Adjusting for Dropout Variance in Batch Normalization and Weight Initialization
نویسندگان
چکیده
We show how to adjust for the variance introduced by dropout with corrections to weight initialization and Batch Normalization, yielding higher accuracy. Though dropout can preserve the expected input to a neuron between train and test, the variance of the input differs. We thus propose a new weight initialization by correcting for the influence of dropout rates and an arbitrary nonlinearity’s influence on variance through simple corrective scalars. Since Batch Normalization, with dropout, estimates the variance of a layer’s incoming distribution with some inputs dropped, the variance also differs between train and test. After training a network with Batch Normalization and dropout, we simply update Batch Normalization’s variance moving averages with dropout off and obtain state of the art on CIFAR-10 and CIFAR-100 without data augmentation.
منابع مشابه
Generalizing and Improving Weight Initialization
We propose a new weight initialization suited for arbitrary nonlinearities by generalizing previous weight initializations. The initialization corrects for the influence of dropout rates and an arbitrary nonlinearity’s influence on variance through simple corrective scalars. Consequently, this initialization does not require computing mini-batch statistics nor weight pre-initialization. This si...
متن کاملBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal co...
متن کاملActive Bias: Training a More Accurate Neural Network by Emphasizing High Variance Samples
Self-paced learning and hard example mining re-weight training instances to improve learning accuracy. This paper presents two improved alternatives based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of minibatch SGD, and the proximity of the correct class probability to the deci...
متن کاملUnderstanding the Disharmony between Dropout and Batch Normalization by Variance Shift
This paper first answers the question “why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together?” in both theoretical and statistical aspects. Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test. However, BN would ...
متن کاملConvolutional Sequence to Sequence Learning
A. Weight Initialization We derive a weight initialization scheme tailored to the GLU activation function similar to Glorot & Bengio (2010); He et al. (2015b) by focusing on the variance of activations within the network for both forward and backward passes. We also detail how we modify the weight initialization for dropout. A.1. Forward Pass Assuming that the inputs x l of a convolutional laye...
متن کامل